Doublet method for very fast autocoding

نویسنده

  • Jules J. Berman
چکیده

BACKGROUND Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. METHODS An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). RESULTS The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. CONCLUSIONS The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Very Fast Load Flow Calculation Using Fast-Decoupled Reactive Power Compensation Method for Radial Active Distribution Networks in Smart Grid Environment Based on Zooming Algorithm

Distribution load flow (DLF) calculation is one of the most important tools in distribution networks. DLF tools must be able to perform fast calculations in real-time studies at the presence of distributed generators (DGs) in a smart grid environment even in conditions of change in the network topology. In this paper, a new method for DLF in radial active distribution networks is proposed. The ...

متن کامل

Rupture characteristics of the 2012 earthquake doublet in Ahar-Varzagan region using the Empirical Green Function method

On August 11, 2012,within several minutes, two shallow destructive earthquakes with moment magnitudes of 6.5 and 6.4 occurred in Varzagan, Azerbaijan-e-Sharghi Province, in the northwest of Iran In this study, the Empirical Green Function (EGF) method was used for strong ground motion simulationto estimate the source parameters and rupture characteristics of the earthquakes. To simulate the fir...

متن کامل

Lateral Vibrations of Single-Layered Graphene Sheets Using Doublet Mechanics

This paper investigates the lateral vibration of single-layered graphene sheets based on a new theory called doublet mechanics with a length scale parameter. After a brief reviewing of doublet mechanics fundamentals, a sixth order partial differential equation that governs the lateral vibration of single-layered graphene sheets is derived. Using doublet mechanics, the relation between natural f...

متن کامل

Very Fast Field Oriented Control for Permanent Magnet Hysteresis Synchronous Motor

In this paper, a new field oriented control scheme with maximum torque for permanent magnet hysteresis synchronous (PMHS) motor has been presented. Vector control method provides significant improvement to the dynamic performance of ac motors but in this method d- axis current is controlled such as the ratio of motor torque to motor current is a maximum, then the dynamic performance will be ver...

متن کامل

Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching

Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • BMC Medical Informatics and Decision Making

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2004